I have an EC2
set up with r3.8xlarge (32 cores, 244G RAM)
.
In my Spark
application, I am reading two csv files from S3
using Spark-CSV
from DataBrick, each csv has about 5 millions rows. I am unionAll
the two DataFrames and running a dropDuplicates
on the combined DataFrame.
But when I have,
val conf = new SparkConf()
.setMaster("local[32]")
.setAppName("Raw Ingestion On Apache Spark")
.set("spark.sql.shuffle.partitions", "32")
Spark is slower than .setMaster("local")
Wouldn't it be faster with 32 cores?